image/svg+xml How many random fragments are enough to obtain the original DNA sequence ? ? ? ? How many random fragments are enough to obtain the original DNA sequence ? ? ? ? G = Length of the genome L = Read length N = Number of reads G=26 L=7 N=5 How many random fragments are enough to obtain the original DNA sequence ? ? ? ? G = Length of the genome L = Read length N = Number of reads What is the minimum number of reads required to cover all the sequence? As the fragments are randomly distributed,it is unlikely that the minimum wil be enough! G=26 L=13 L=13 N=2 How many random fragments are enough to obtain the original DNA sequence ? ? ? ? G = Length of the genome L = Read length N = Number of reads Let's look at the coverage per nucleotide instead. The average coverage per nucleotide is: What is the probability of a read with size L covers 1 nucleotide in the sequence of size G? G=26 L=13 What is the probability of 1 read with size L covers 1 nucleotide in the sequence of size G? What is the probability of N (4) reads covers 1 nucleotide exactly x (2) times? G=26 L=13 p is small: size of the read (L=100 bases) is very small compared to the size of the sequence (G=23 billion bases) It's a Binomial distribution! The Binomial distribution converges towards the Poisson distribution when p tends to zero: ? ? ? G =Length of the genome L=Read length N=Number of reads How many random fragments are enough to obtain the original DNA sequence ? G=26 L=7 N=5 0.0 1.0 2.0 3.0 4.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Coverage Probability more than 25% of the genome not covered What is the probability of N reads with size L covers 1 nucleotide in the sequence of size Gexactly x times? read size genome size How to obtain the original DNA sequence from its fragments? ? ? ? GRAPHS! How to obtain the original DNA sequence from its fragments? ? ? ? GRAPHS! How to obtain the original DNA sequence from its fragments? ? ? ? Ideas for solving our problem, as well as the entire branch of mathematics known today as graph theory, can be traced back 300 years ago to this city: Königsberg (present-day Kaliningrad, Russia) At the time, Königsberg’s residents enjoyed ­strollingthrough their city, and they w­ondered if every part of the city could be visited by walking across each of the seven bridges exactly once and returning to one’s starting ­location. 7 bridges city has 4 parts separated by the Pregel River ! Leonhard Euler made a conceptual breakthrough Euler’s first insight was to represent each landmass as a point (called a node) and each bridge as a line segment (called an edge) ­ This creates a graph, a network of nodes connected by edges. Is it possible to visit every part of the city bywalking across each of the seven bridges exactly once? ! Is it possible to visit every part of the city bywalking across each of the seven bridges exactly once? When it is possible, we call Eulerian cycle (or path) NO! In 1946 the Dutch mathematician Nicolaas de Bruijn became interested in the superstring problem Bridges of Königsberg problem Bridges of Königsberg problem Bridges of Königsberg problem Bridges of Königsberg problem Bridges of Königsberg problem 200 years later... ? ? ? angel gel elf fish angelfish In 1946 the Dutch mathematician Nicolaas de Bruijn became interested in the superstring problem 200 years later... ? ? ? angel gel elf fish angelfish All words have k letters (k-mers)... Constrained version: ...but it can be any alphabet Superstring problem - Example ? ? ? Alphabet: 0 and 1 Input: All possible 3-mers 000 011 0001110100 010 001 100 111 110 101 Output: ! ! ! ! Superstring problem De Bruijn answered this question by borrowing Euler’s solution of the Bridges of Königsberg problem Every possible (k-1)-mer is assigned to a node: 110 11 10 Two nodes are connected by a directed edge if there is a word whose prefix is the former and whose suffix is the latter prefix suffix Superstring problem - Example Alphabet: 0 and 1 Input: All possible 3-mers 000 011 ? 010 001 100 111 110 101 Output: ! ! ! 11 10 01 00 000 111 110 001 100 011 101 010 Superstring problem - Example Alphabet: 0 and 1 Input: All possible 3-mers 000 011 010 001 100 111 110 101 Output: ! ! ! 11 10 01 00 000 111 110 001 100 011 101 010 1 2 3 4 5 6 7 8 circular superstring(Eulerian circuit) 00011101 50 years later... De Bruijn graphs! How to obtain the original DNA sequence from its fragments? Can you assemble your DNAwith De Bruijn graphs? Our alphabet: Input: reads (k-mers) Output: DNA sequence (Eulerian circuit) 11 10 01 00 000 111 110 001 100 011 101 010 1 2 3 4 5 6 7 8 00011101 Now that we have access to the information in our DNA, how can we use it? 3 How can we use it? Changes (mutations) accumulate over time What is the probability of N (4) reads covers 1 nucleotide exactly x (2) times? It's a Binomial distribution! read size genome size And what about your DNA? What is the probability of one base not being covered? Genome assembly - Challenges Repeats! 2 2 2 2 Genome assembly - Software ...and errors during reading! SOAPdenovo Velvet ALLPATHS ABySS Pavel Pevzner (University of California, San Diego) connecting the appropriate two points.
1
  1. IntroCovQuestion
  2. MinNbReads
  3. MinUnlikely
  4. Coverage
  5. ProbOneReadCovers
  6. ProbOneReadCovers2
  7. BinomialXreadsCover1Base
  8. pSmall
  9. PoissonDistr
  10. PoissonDistrEx
  11. ActivityCov
  12. IntroAssembly
  13. IntroAssebGraph
  14. GraphIntro
  15. KonigsbergProblem
  16. GraphConcepts
  17. KonigsbergSolve
  18. KonigsbergSolution
  19. SuperStringIntro
  20. SuperStringConstrained
  21. SuperStringExample
  22. DeBruijnGraph
  23. DeBruijnGraphExample
  24. DeBruijnGraphOutput
  25. AssemblyDeBruijn
  26. ActivityAssemble
  27. Repeats
  28. SoftwarePevzner